Segmentation-based cardiomegaly detection based on semi-supervised estimation of cardiothoracic ratio

The successful integration of neural networks in a clinical setting is still uncommon despite major successes achieved by artificial intelligence in other domains. This is mainly due to the black box characteristic of most optimized models and the undetermined generalization ability of the trained architectures. The current work tackles both issues in the radiology domain by focusing on developing an effective and interpretable cardiomegaly detection architecture based on segmentation models. The architecture consists of two distinct neural networks performing the segmentation of both cardiac and thoracic areas of a radiograph. The respective segmentation outputs are subsequently used to estimate the cardiothoracic ratio, and the corresponding radiograph is classified as a case of cardiomegaly based on a given threshold. Due to the scarcity of pixel-level labeled chest radiographs, both segmentation models are optimized in a semi-supervised manner. This results in a significant reduction in the costs of manual annotation. The resulting segmentation outputs significantly improve the interpretability of the architecture’s final classification results. The generalization ability of the architecture is assessed in a cross-domain setting. The assessment shows the effectiveness of the semi-supervised optimization of the segmentation models and the robustness of the ensuing classification architecture.

The remainder of the work is organized as follows.In the "Materials and methods" section, a description of the data sets involved in the current study is provided, followed by a thorough description of the segmentationbased cardiomegaly detection architecture.A description of the experimental settings, as well as the performed experiments with the corresponding results is subsequently provided in the "Experimental settings and results" section.Next, a discussion of the depicted results is provided in the "Discussion" section, before the work is finally concluded in the "Conclusion" section, with a brief summary of the main findings of the study, as well as an outlook on potential future works.

Materials and methods
In the current work, the assessment of the proposed cardiomegaly detection architecture is performed in a crossdomain setting.More specifically, the overall generalisation ability of the designed classification architecture is assessed by performing the optimisation of a model using a specific data set (referred to as training set), and subsequently evaluating the optimised model on additional data sets stemming from different domains (other than the one specific to the training set).In the specific case of cardiomegaly detection from chest X-ray images, this can be done by using a data set stemming from a specific institution to optimise the model, and performing the subsequent evaluation of the trained model on data stemming from different institutions or collected using different protocols and hardware.This form of evaluation, although particularly challenging due to the difference between the data distribution of the training set and the one of the evaluation set (also known as domain shift), constitutes a rather robust assessment of the generalization ability of the designed and optimised classification architecture.In the following, a description of the specific data sets used throughout the current work is provided, followed by a thorough description of the designed cardiomegaly detection architecture.

Chest X-ray data sets
The Japanese Society of Radiological Technology (JSRT) database 24 is a publicly available data set consisting of a total of 247 posteroanterior (PA) chest radiographs (100 with malignant pulmonary nodules, 54 with benign pulmonary nodules and 93 without a nodule) of 2048 × 2048 pixels resolution with a 0.175 millimeter (mm) pixel-size and a 12-bit depth, collected from 13 medical centers in Japan and 1 additional institution in the United States.Manually generated segmentation masks for the lungs, heart and clavicles, for each single image are provided by the Segmentation in Chest Radiographs (SCR) database 25 .Thus, the JSRT database is primarily used to optimise and assess multi-organs or single-organ segmentation models in a supervised learning setting 26,27 .
The publicly available Pathology Detection in Chest Radiographs (PadChest) data set 28 consists of a total of 160, 868 radiographs, stemming from 67, 625 patients and recorded at the San Juan Hospital in Spain between 2009 and 2017.In contrast to the JSRT data set, no segmentation mask is available for the PadChest data set.Instead, the radiographs are annotated into a total of 170 distinct categories of radiographic findings (image-level labels), including cardiomegaly.The data set comprises chest X-ray images recorded in six different positions, including standing posteroanterior (PA) and lateral (L) views, anteroposterior (AP) supine and erect views, lordotic and oblique sternum views.Around 27% of the entire data set was manually annotated by trained physi- cians and cases where no anomalies were found were subsequently annotated as normal.The remaining 73% of the data set was automatically annotated using an attention-based recurrent neural network (trained using the set of manually annotated X-ray images).The experiments in the current work are carried out based uniquely on the manually annotated radiographs recorded in a standing posteroanterior view.Furthermore, since the current study focuses on the detection of cases of cardiomegaly, the optimisation process is performed based on image samples labeled either as cases of cardiomegaly or as normal.
The Indiana University chest X-ray Collection (CXR OpenI) 29 is a publicly available data set, consisting of around 7470 manually labeled chest X-ray images, recorded in both lateral and posteroanterior views and stemming from various hospitals of the Indiana University School of Medicine.The data set is extracted from the National Library of Medicine (NLM) using the Open Access Biomedical Search Engine (OpenI) 30 .Similarly to the PadChest data set, there is no segmentation mask available for the data set.Furthermore, the data retrieved for the assessment of the proposed cardiomegaly detection architecture consist of chest X-ray images recorded in a posteroanterior position and labeled either as cases of cardiomegaly or as normal.
A custom data set (CXR Ulm) consisting of manually annotated posteroanterior radiographs stemming from a total of 131 patients (31 female and 100 male) and collected within a study at the Department of Diagnostic and Interventional Radiology of the Ulm University Medical Center in Germany, is also used for the assessment of the proposed cardiomegaly detection architecture.The data stems from a study which was (i) approved by the Ethics Commitee of the local Medical Faculty and the University Hospital (Confirmation number 115/21) and was also (ii) compliant with regards to the Health Insurance Portability and Accountability Act (HIPAA) and conducted in accordance with the Declaration of Helsinki.Additionally, informed consent was waived by the local Ethics Commitee based on the retrospective nature of study.The annotation of the data set was performed by two trained radiologists, who not only provided a label for the detected pathology but also segmentation masks for both lungs and cardiac organs.The chest X-ray images were labeled as cases of cardiomegaly based on computed cardiothoracic ratios, with a fixed threshold of 0.55.
The National Institutes of Health Chest X-Ray Database (CXR NIH) 31 is a publicly available data set consisting of around 112, 120 chest X-ray images, stemming from 30, 805 patients and automatically annotated using different Natural Language Processing (NLP) techniques into either one or several categories of a total of 14 thoracic pathologies (including cardiomegaly).In cases where no pathologies were reported, the corresponding images were labeled as normal.There are also no segmentation masks available for this specific data set.Analogously to the previous data sets (PadChest, CXR OpenI, CXR Ulm), assessment experiments are performed using uniquely chest X-ray images belonging to both cardiomegaly and normal classes, and collected in a www.nature.com/scientificreports/posteroanterior view.Therefore, all images belonging to the category of cardiomegaly are selected, and the same amount of images is randomly selected from the set of images labeled as normal, in order to form the assessment set specific to this data set.A summary of the data distribution specific to each of these data sets is displayed in Table 1.
During the assessment of the proposed architecture, both JSRT and PadChest data sets are used as training sets, while the optimised architecture is subsequently evaluated on each of the remaining sets (CXR OpenI, CXR Ulm, CXR NIH).None of the samples specific to these evaluation sets are seen during the parameter optimisation of the classification architecture.

Methodology
The cardiomegaly detection architecture presented in the current work consists of a segmentation-based classification approach.As depicted in Fig. 1, the architecture comprises two distinct models (which are basically two neural networks), optimised to perform the segmentation of both lungs and heart areas respectively.Given an input image, each model generates a segmentation mask of the corresponding area of interest.
Bounding boxes around the resulting areas of interest are subsequently computed, followed by the computation of the corresponding cardiothoracic ratio (CTR) based on the widths of both bounding boxes.Finally, based on a specific threshold ( π ), the input image is classified either as a case of cardiomegaly ( CTR > π ) or as normal ( CTR ≤ π ), as described in Equation 1: The overall performance, as well as the generalization ability of this specific classification architecture is inherently bound to the capacity of both models to accurately perform the segmentation of the corresponding areas of interest.In other words, the more accurate the resulting segmentation masks, the higher the classification performance of the architecture.Thus, a huge amount of annotated data is needed in order to perform some optimal optimisation of both segmentation models.In a supervised learning setting, each model is optimised based on a labeled set of images, consisting of chest X-ray images with the corresponding manually generated segmentation masks (pixel-level labels).However, manually annotated segmentation data for chest X-ray images are rather scarce (since such an annotation process is costly and time consuming), in contrast to the abundance  www.nature.com/scientificreports/ of unlabeled chest X-ray data.Therefore, the optimisation of both segmentation models is performed in a semisupervised learning setting, where a model is optimised based on two specific sets of data: 1. a set of labeled data X l = { x l 1 , y l 1 , . . ., x l n , y l n } (where x l i ∈ [0, 255] W×H×C corresponds to the i-th chest X-ray image with a width W , a height H and a total of C channels, and the corresponding pixel-level label y l i ∈ [0, 1] W×H , with 0 corresponding to pixels specific to the background and 1 depicting pixels of the area of interest).2. a significantly larger set of unlabeled data X u = {x u 1 , . . ., x u m } (with n ≪ m).
Inspired by the work presented by Ouali et al. 23 , cross-consistency training is applied in order to perform the optimisation of a models' parameters.An overview of the architecture and training procedure is depicted in Fig. 2. The architecture consists of a total of three neural networks: a shared encoder E , a main decoder D and an auxiliary decoder D aux .Images stemming from both labeled and unlabeled sets are simultaneously fed into the shared encoder E .The generated latent representations are subsequently fed into the two remaining neural networks.The representations specific to both labeled and unlabeled images are fed into the main decoder D.
Concurrently, a set of k stochastic perturbations are applied on each of the representations stemming from the set of unlabeled images, and the resulting altered representations are fed into the auxiliary decoder D aux .Based on the output of the main decoder and the provided labels, the parameters of the main decoder are optimised using a supervised loss function L S .Meanwhile, the parameters specific to the auxiliary decoder are optimised i and z u j ) are subsequently fed into the main decoder, which generates the segmentation masks for both labeled and unlabeled images ( ŷl i and ŷu j ).Concurrently, a set of k distinctive perturbations ( P ) are applied to the latent representations specific to the unlabeled samples ( z u j ), and the resulting representations ( {ẑ u,d j } 1≤d≤k ) are fed into the auxiliary decoder.The resulting set of auxiliary segmentation masks ( {ŷ u,d j } 1≤d≤k ) are used in combination with the corresponding output of the main decoder ( ŷu j ) to compute an unsupervised loss ( L U ), while the supervised loss ( L S ) is calculated based on the labeled samples' output of the main decoder ( ŷl i ) and the corresponding labels ( y l i ).
Vol:.( 1234567890) www.nature.com/scientificreports/based on an unsupervised loss function L U , computed based on its output and those from the main decoder stemming uniquely from the unlabeled samples.This is done in order to enforce a certain level of consistency between the output of the main decoder D and the auxiliary decoder D aux .The parameters of the shared encoder are optimised based on a weighted sum of both loss functions ( L ), as follows: where, ω U is a weighting function specific to the unsupervised loss.By enforcing the segmentation consist- ency between the output of the main decoder and the one of the auxiliary decoder regarding unlabeled chest X-ray images, the representations generated by the shared encoder are further enhanced by taking advantage of additional information stemming from unlabeled samples.Following the optimisation of the models, both the shared encoder and the main decoder are used to perform the segmentation of unseen samples during inference.
In the current work, the supervised loss consists of the combination of a pixel-level classification loss and a segmentation loss as depicted in Eq. ( 3): where bs l depicts the batch size for the set of labeled samples, H represents the Binary Cross-Entropy loss (BCE) and dice represents the Dice loss.Meanwhile, the unsupervised loss consists of the Mean Squared Error (MSE) between the output of the main decoder and those of the auxiliary decoder specific to the unlabeled set of images, as depicted in Eq. ( 4): where bs u depicts the batch size for the set of unlabeled samples and d(ŷ u j , ŷu,p j ) (see Eq. 5) represents the pixel- level squared error between both outputs ŷu j (from the main decoder D ) and ŷu,p j (from the auxiliary decoder D aux ).The weighting function ω U specific to the unsupervised loss corresponds to a Gaussian ramp-up function (see Eq. 6) as proposed by Laine and Aila 32 : where t depicts the current optimisation epoch and L depicts the ramp-up length.The computed unsupervised loss weight slowly ramps up from 0 to 1 during the optimisation process, therefore reducing the impact of noisy segmentation outputs of the main decoder D during the early phase of the optimisation process.The perturba- tions applied to the latent representations specific to the unlabeled images consist of feature based perturbations and random perturbations (as proposed by Ouali et al. 23 ).These perturbations have not just proven to be effective in such areas as semi-supervised semantic segmentation 23,33,34 or object localisation 35 , but are also very simple to implement.
Feature based perturbations consist of injecting random noise into the latent representation stemming from the shared encoder E .Two specific feature based perturbations are applied to the latent representation in the current work: 3) being a uniformly sampled random tensor of the same shape as z u j .• F-Drop: ∀j, z u,2 j = z u j ⊙ M drop , where M drop represents a binary tensor of the same shape as z u j obtained based on an uniformly sampled threshold γ ∼ U (0.6, 0.9) and the normalized channel-wise averaged tensor of z u j , denoted zu j as follows: M drop = 1 zu j >γ .
Random perturbations consist of randomly dropping some of the activations of the latent representation.In the current work, Dropout is applied to the latent representation with an uniformly sampled dropout rate r ∼ U (0.1, 0.7).

Experimental settings and results
In the following section, a thorough description of the performed experiments is provided.First, the experimental settings are described, followed by a description and discussion of each performed experiment with its corresponding results. (2)

Experimental settings
Before being fed into the designed architecture, chest X-ray images are pre-processed in order to significantly reduce the amount of noise within the images and homogenize the structure of the input data across different domains.In the current work, each image is first resized to the shape 299 × 299 × 3 and subsequently converted into a single-channel gray-scale image.Subsequently, Contrast Limited Adaptive Histogram Equalization (CLAHE) 36 is applied in order to enhance the contrast of the resulting gray-scale image.Next, a three-channel image is generated by replicating the single-channel contrast enhanced image three times.And finally, the pixel values of the resulting image are normalized within the range [0, 1] , by dividing each pixel value in each of the three channels by the maximum pixel value of 255.Moreover, since the labeled set of data consists of a rather limited amount of chest X-ray images (The JSRT database consists of a total of 247 chest X-ray images with the corresponding pixel-level labels), data augmentation is performed (uniquely on the set of labeled images) by applying a set of geometrical transformations consisting of random horizontal and vertical flipping, random image rotation in a range of [0 • , 10 • ] , and a 10% image zoom-in.The transformations are applied on both chest X-ray images and the corresponding pixel-level labels in order to generate an increased amount of consistent labeled data.Each segmentation model consists of an Encoder-Decoder network which takes as input a chest X-ray image and as label the corresponding pixel-level annotated image.In a semi-supervised learning setting, both decoders ( D and D aux ) have an identical architecture.While the parameters of the main decoder D are optimised by using the supervised loss, the parameters of the auxiliary decoder D aux are optimised by using the unsupervised loss.The parameters of the encoder E are optimised by a weighted sum of both supervised and unsupervised Table 2. Neural networks' architectures.

Layer
No. filters Kernel size Strides Padding

Batch normalization
Activation: ReLU www.nature.com/scientificreports/losses.A main component of the designed neural networks consists of a convolutional block, which comprises a 2-dimensional convolutional layer, followed by a Batch Normalization layer as a regularization approach and subsequently a Rectified Linear Unit (ReLU) activation function.The ensuing feature map is subsequently fed into an attention layer, consisting of the Convolutional Block Attention Module (CBAM) 37 .The designed architectures of the encoder ( E ), and both decoders ( D , D aux ) are depicted in Table 2.
During the optimisation phase in a semi-supervised setting, a specific batch size ( bs ) is set, such that bs = bs l + bs u : bs l represents the batch size specific to the set of labeled samples and bs u corresponds to the batch size specific to the set of unlabeled samples.Given bs , bs l and bs u are computed as depicted in Eq. ( 7) (where n is the number of labeled samples in the training set and m is the number of unlabeled samples in the training set) and Eq. ( 8).Due to memory constraints, bs is set to 16 in the current work.
Moreover, the optimisation is performed with a fixed learning rate set empirically to 10 −3 for a total of 100 epoches.The ramp-up length L (see Eq. ( 6)) is set to 50.The optimiser used throughout the current work consists of the Adaptive Moment Estimation optimisation algorithm (Adam) 38 .During the optimisation phase, 20% of the set of labeled samples are used as validation set and the remaining 80% is used as training set.Concerning the set of unlabeled samples, 10% is used as validation set and the remaining 90% as training set.The Jaccard index (see Eq. ( 9)) is used as segmentation performance evaluation metric: where y represents a pixel-level annotated image (ground truth) and ŷ the output of the decoder (prediction).Following the optimisation of both segmentation models (one for the heart and the other for both lungs), the performance of the cardiomegaly detection architecture (see Fig. 1) is assessed based on the following performance assessment metrics: The Area Under the Receiver Operating Characteristic (ROC) Curve (AUC) is also used as an additional performance evaluation metric.Thereby, cases of cardiomegaly are set as belonging to the positive class (positive instances), while normal samples are set as belonging to the negative class (negative instances) throughout the entirety of the performed experiments.All implementations and evaluations performed in the current work were done with the libraries Tensorflow 39 , Keras 40 , and Scikit-learn 41 .The optimisation process of each segmentation model was performed on a single Tesla V100 SXM2 Graphics Processing Unit (GPU) with 32 gigabytes of memory, with the Compute Unified Device Architecture (CUDA) version 11.4.The inference was subsequently performed on a M1 Macbook Pro with 16 gigabytes of memory.

Segmentation-based cardiomegaly detection
The first experiment consists of assessing the performance of the proposed segmentation-based cardiomegaly detection architecture in both supervised and semi-supervised settings, and comparing the performance of each optimisation approach based on the data sets CXR Ulm, CXR OpenI, and CXR NIH.In order to perform the classification task, a threshold of π = 0.55 is used for the CXR Ulm data set, which is the same threshold used by the physicians who performed the annotation of the data set.For both CXR OpenI and CXR NIH data sets, the threshold is set as follows: π = 0.50 .None of the images specific to these data sets have been seen during the optimisation process of the segmentation models.Even though such an assessment is rather challenging, it usually depicts the true generalisation ability of the proposed classification approach.In a supervised learning setting, a segmentation model consists uniquely of the encoder E and the main decoder D .Its optimisation is performed based uniquely on the set of labeled samples (the JSRT data set) and also by using uniquely the supervised loss ( L S ).The remaining optimisation parameters are the same as in the case of the semi-supervised model optimisation.In a semi-supervised learning setting, the architecture is as described in Fig. 3 and the set of labeled instances consists of the JSRT data set, while the set of unlabeled instances consists of the PadChest data set.The results of the cardiomegaly detection task, based on segmentation models trained in both supervised and semi-supervised settings are depicted in Table 3.
At a glance, the architecture consisting of segmentation models optimised in a semi-supervised learning setting systematically outperforms the one based on models optimised in a supervised learning setting for all data sets.Thus, the output of a model trained in a semi-supervised manner is more accurate than the one of a model trained in a supervised manner.This is also confirmed by the segmentation performance on both lungs and heart areas for the CXR Ulm data set: in a supervised learning setting the lungs' segmentation model achieves an averaged Jaccard score of 86.07%, while the heart's segmentation model achieves an averaged Jaccard score The cross-consistency training approach successfully extracts meaningful information from a set of unlabeled samples in order to enhance the latent representation stemming from the decoder and therefore significantly improves the resulting segmentation output.Learning uniquely from a significantly smaller set of labeled data leads to sub-optimal segmentation results, even after the application of data augmentation.Additionally, while considering the depicted classification results, one can see that the designed cardiomegaly detection architecture performs rather well in a cross-domain setting, since good classification performances can be observed across all the data sets.Thus, the designed architecture based on segmentation models trained in a semi-supervised manner exhibits a good generalisation ability.Since the labels specific to the CXR Ulm data set are available, a visualization of some of the segmentation outputs with models trained in a semi-supervised manner is depicted in Fig. 3.The color blue is specific to the lungs while the color red refers to the heart.The displayed contours constitute the ground truth (or manually generated segmentation results), while the filled areas constitute the segmentation models' output (depicted in the top row of Fig. 3).Furthermore, the resulting bounding boxes, based on the segmentation models' output are also displayed (in the bottom row of Fig. 3).While the lungs' can be relatively well segmented, the heart area revealed to be rather challenging.The segmentation model specific to the heart had more difficulties in performing an accurate segmentation in several cases.It was also observed that most of the occurred miss-classifications were due to an inaccurate segmentation of the heart.Therefore, it is believed that an improvement of the heart segmentation model should also result in an improvement of the overall performance of the cardiomegaly detection architecture.

Classification-vs. segmentation-based cardiomegaly detection
The next experiment consists of comparing the performance of the segmentation-based cardiomegaly detection approach, to the one of a classification-based cardiomegaly detection approach.Based on the work presented  www.nature.com/scientificreports/by Thiam et al. 9 , a model is trained using uniquely the PadChest data set with the corresponding image-level labels (one-hot encoding consisting of (1, 0) for normal and (0, 1) for cardiomegaly) in order to generate a classification-based cardiomegaly detection model using a transfer learning approach.The architecture of the model comprises a backbone consisting of pre-trained convolutional layers, followed by an additional and single trainable convolutional layer and a subsequent global average pooling (GAP) layer.The backbone is generated by removing the top fully connected (FC) layers of a pre-trained deep neural network and freezing the remaining convolutional layers.
For the current experiments, the backbone consists of the InceptionV3 model 42 trained on the ImageNet database 43 .Subsequent layers are added on top of the backbone: first a trainable convolution layer consisting of 1024 filters ( 3 × 3 kernels and 1 × 1 strides), followed by a Batch Normalization layer and a subsequent Rectified Linear Unit (ReLU) activation; the output is subsequently fed into a global average pooling (GAP) layer.The resulting feature representation is subsequently fed into a classifier consisting of two subsequent fully connected layers.The first layer uses a ReLU activation function with a total of 512 units and the second layer uses a Softmax activation function with a total of 2 units to generate the final output of the classification model.Regularization is performed in this case by placing Dropout layers with a fixed dropout rate of 0.25 between both fully connected layers, as well as between the GAP layer and the first fully connected layer.During the optimisation process, a fixed learning rate of 10 −6 is used and the batch size is set to 16.The optimisation process goes on for a total of 200 iterations, using the Adam optimiser.In order to account for the imbalanced data distribution within the PadChest data set, samples are weighted as follows: where m = m − + m + , with m − = �{x u j ∈ X u | y j = (1, 0)}� and m + = �{x u j ∈ X u | y j = (0, 1)}� .Following its optimisation, the classification model is subsequently applied on the testing sets CXR Ulm, CXR OpenI, and CXR NIH.The yielded results are summarised and depicted in Table 4, while the corresponding confusion matrices are depicted in Fig. 4.
It can be clearly seen that the segmentation-based cardiomegaly detection approach, based on segmentation models trained in a semi-supervised manner outperforms the classification-based cardiomegaly detection approach for each of the testing sets, thus further pointing to the effectiveness of the proposed detection architecture.Similar results have been presented by Sogancioglu et al. 19 , where the authors performed a comparison of both segmentation-based and classification-based cardiomegaly detection on a single data set.The results reported in that study also show that segmention-based detection approaches outperform classificationbased detection approaches (even though the experiments were conducted on a single data set).Furthermore, while the impact of the domain shift can be seen from the results specific to the classification-based detection approach, both segmentation-based detection approaches (in both supervised and semi-supervised settings) prove to be more robust in this regard by yielding better classification performances for almost all of the testing sets.

U-Net vs. semi-supervised segmentation based cardiomegaly detection
Based on the fact that most of the previous works rely on a U-Net model to perform the segmentation of specific areas of interests in a chest X-ray image, the last experiment consists of comparing the performance of the described segmentation-based cardiomegaly detection approach with segmentation models trained in a semisupervised manner, to the one of a cardiomegaly detection approach based on U-Net models.The encoder of the U-Net models consists of the frozen pre-trained convolutional layers of an InceptionV3 model (optimised on the ImageNet data set 44 ).The decoder consists of the mirrored layers of the encoder (with additional up-sampling layers), with skip connections between the corresponding layers.Each skip connection is followed by a Dropout layer with a rate fixed empirically to 0.3.The whole architecture is trained using uniquely the JSRT data set (with the corresponding segmentation masks) for a total of 100 epochs, with a fixed learning rate of 10 −4 .The Focal loss 45 is used in this case to optimise the whole architecture with an Adam optimiser.The classification results depicted in Table 5 clearly show that the overall performance of the semi-supervised cardiomegaly detection architecture is substantially better than the one of the detection approach based on U-Net models.These results reinforce the previously stated assumption that the diversity of the data set used to optimise a classification model plays a crucial role in its generalisation ability, since the models trained in a semi-supervised manner are able to improve the feature representations generated by the encoder by using additional information stemming

Discussion
The presented results clearly show that the performance of a segmentation model can be substantially improved by using information stemming from unlabeled samples.In the current work, cross-consistency training has proven to be a simple and effective semi-supervised training approach.Some significant performance improvement of the cardiomegaly detection architecture could be achieved by using models trained in a semi-supervised manner, in comparison to using models trained in a supervised manner.Thus, an effective integration of information stemming from unlabeled samples can significantly reduce the amount of labeled samples required in order to achieve high classification performances, therefore significatly reducing the costs of manual annotation.Additionally, the results of the subsequent experiments show that the proposed segmentation-based cardiomegaly detection approach outperforms the classification-based approach (based on image-level labels) in a cross-domain setting.Moreover, segmentation-based cardiomegaly detection approaches proved to be more robust than classification-based cardiomegaly detection approaches, regarding the domain shift observed while performing the detection in a cross-domain setting.Furthermore, since the output of the segmentation models can be easily plotted and visualized, the resulting classification results can be easily interpreted, therefore bringing more clarity to the generated predictions and allowing the identification of the detection architecture's flaws and limitations.This is particularly relevant in a clinical setting, where a visualisation of the automatically generated segmentation provides more insights than the results of the classification or detection task alone.Finally, even though the experiments were performed in a challenging cross-domain setting, the yielded results point at a good generalisation ability of the proposed architecture.Previous works generally focus on single data sets and report similar results 15,19 .However, such approaches suffer from the domain shift when applied on data sets stemming from other centers, resulting in sub-optimal performances.Thus, the diversity of the data sets used to perform the optimisation of the segmentation models plays a significant role in the resulting generalisation ability of the optimised models.

Conclusion
As a summary, cross-consistency training has proven to be very effective since the segmentation models trained in a semi-supervised setting were able to significantly improve the performance of the cardiomegaly detection architecture, in comparison to the models trained in a supervised manner.The diversity of the data sets used for the optimisation of the segmentation models positively impacted the generalisation ability of the detection architecture in a cross-domain setting.The interpretability of the generated results is further improved by the segmentation-based approaches, which is of upmost importance in a clinical setting.However, it is believed that the performance of the proposed architecture can be further improved, by enhancing the performance of the model specific to the heart area, since the observed miss-classifications were mostly due to an inaccurate segmentation output of this specific area of interest.Future directions of the current work could consist in assessing other forms of perturbations to be applied at different levels of granularity within the segmentation models, as well as an assessment of other semi-supervised learning approaches for the optimisation of the segmentation models 46 . https://doi.org/10.1038/s41598-024-56079-1

Figure 1 .
Figure 1.Segmentation based cardiomegaly detection architecture.Two distinct segmentation models are applied on an input image to perform the segmentation of both cardiac and lungs' areas.Bounding boxes around the areas of interest are subsequently computed based on the resulting segmentation masks.The CTR score is calculated based on the widths of the respective cardiac and lungs' bounding boxes.Based on a fixed threshold ( π ) and the computed CTR score, the input image is finally classified either as a case of cardiomegaly or as normal.

Figure 2 .
Figure 2. Semi-supervised segmentation approach.The architecture consists of an encoder E , a main decoder D and an auxiliary decoder D aux .During each iteration, labeled samples ( x l i) and unlabeled samples ( x u j ) are fed into the shared encoder.The resulting representations ( z l i and z u j ) are subsequently fed into the main decoder, which generates the segmentation masks for both labeled and unlabeled images ( ŷl i and ŷu j ).Concurrently, a set of k distinctive perturbations ( P ) are applied to the latent representations specific to the unlabeled samples ( z u j ), and the resulting representations ( {ẑ u,d j } 1≤d≤k ) are fed into the auxiliary decoder.The resulting set of auxiliary segmentation masks ( {ŷ u,d j } 1≤d≤k ) are used in combination with the corresponding output of the main decoder ( ŷu j ) to compute an unsupervised loss ( L U ), while the supervised loss ( L S ) is calculated based on the labeled samples' output of the main decoder ( ŷl i ) and the corresponding labels ( y l i ). https://doi.org/10.1038/s41598-024-56079-1
of 65.38% ; in a semi-supervised learning setting however, the lungs' segmentation model achieves an averaged Jaccard score of 89.88% , while the heart's segmentation model achieves an averaged Jaccard score of 76.87% .

Figure 3 .
Figure 3. Semi-supervised segmentation results (CXR Ulm).The top row consists of the segmentation models' output (filled areas) and the ground truth (contours).At the bottom, the exact same set of images is displayed as above, this time however uniquely with the segmentation model's output and the computed bounding boxes around the areas of interest.

Figure 4 .
Figure 4. Confusion matrices.The label 0 corresponds to normal CXR images, while the label 1 corresponds to cases of cardiomegaly.

Table 1 .
Data distribution.Number of image samples specific to each class, for each of the data sets.

Table 3 .
Cardiomegaly detection performance.The numbers in bold depict the best overall performance across all evaluated approaches.

Table 4 .
Classification approach vs. segmentation approach.The performances are depicted in terms of geometric mean (G-Mean).The numbers in bold depict the best overall G-Mean performance across all evaluated approaches.Significant values are in bold.

Table 5 .
a diverse set of unlabeled CXR images, resulting in improved segmentation outputs (and thus improved cardiomegaly detection results).